{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](hhttps://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation with Alpaca Eval 2.0\n",
    "\n",
    "This notebook discusses how you can run E2E evaluations for your trained model, using Oumi inference for generating the responses, and [Alpaca Eval 2.0](https://github.com/tatsu-lab/alpaca_eval) for automatically calculating the win-rates vs. GPT4 Turbo (or other reference models of your choice)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Configuration\n",
    "\n",
    "First, start by installing the [Alpaca Eval package](https://pypi.org/project/alpaca-eval/) and LlamaCPP (for inference):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U -q alpaca_eval llama-cpp-python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When comparing your model's responses vs. the reference responses to calculate the win rates, an annotator (judge) is needed. By default, the annotator is set to GPT4 Turbo (annotator config: [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval?tab=readme-ov-file#alpacaeval-20)). To access the latest GPT-4 models, including GPT4 Turbo, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\"  # Set your OpenAI API key here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>⚠️ Cost considerations</b>: The cost of running a standard Alpaca evaluation 2.0 (with [weighted_alpaca_eval_gpt4_turbo](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) config) and annotating 805 examples with GPT4 Turbo is <b>$3.5</b>. However, the sample code of this notebook only annotates 3 (of 805 examples) and costs less than <b>0.5¢</b>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_EXAMPLES = 3  # Replace with 805 for full dataset evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define your model and the max number of tokens it supports (to be used during generation). You can point to any model in HuggingFace, provide a path to a local folder that contains your model, or any other model format that Oumi inference supports. Also, please provide a (human friendly) display name for your model, to be used when displayed in leaderboards. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_NAME = \"bartowski/Llama-3.2-1B-Instruct-GGUF\"\n",
    "MODEL_DISPLAY_NAME = \"MyLlamaTestModel\"\n",
    "MODEL_MAX_TOKENS = 8192"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll create a tutorial directory to store our results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "tutorial_dir = \"alpaca_eval_tutorial\"\n",
    "\n",
    "Path(tutorial_dir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Retrieve Alpaca dataset\n",
    "\n",
    "Alpaca Eval 2.0 requires model responses for the [tatsu-lab/alpaca_eval](https://huggingface.co/datasets/tatsu-lab/alpaca_eval) dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2025-01-16 16:02:59,880][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: AlpacaEvalDataset)...\n",
      "[2025-01-16 16:03:00,656][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:470] Dataset Info:\n",
      "\tSplit: eval\n",
      "\tVersion: 1.0.0\n",
      "\tDataset size: 554496\n",
      "\tDownload size: 620778\n",
      "\tSize: 1175274 bytes\n",
      "\tRows: 805\n",
      "\tColumns: ['instruction', 'output', 'generator', 'dataset']\n",
      "[2025-01-16 16:03:00,737][oumi][rank0][pid:79891][MainThread][INFO]][base_map_dataset.py:408] Loaded DataFrame with shape: (805, 4). Columns:\n",
      "instruction    object\n",
      "output         object\n",
      "generator      object\n",
      "dataset        object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "from oumi.datasets.evaluation import AlpacaEvalDataset\n",
    "\n",
    "alpaca_dataset = AlpacaEvalDataset(dataset_name=\"tatsu-lab/alpaca_eval\").conversations()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since this notebook contains sample code, we will only run inference for the first `NUM_EXAMPLES` (of 805) from the dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 [USER: What are the names of some famous actors that started their careers on Broadway?]\n",
      "1 [USER: How did US states get their names?]\n",
      "2 [USER: Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don't take advantage of me?]\n"
     ]
    }
   ],
   "source": [
    "alpaca_dataset = alpaca_dataset[:NUM_EXAMPLES]  # For testing purposes, reduce examples.\n",
    "\n",
    "for index, conversation in enumerate(alpaca_dataset):\n",
    "    print(index, conversation.messages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Run inference\n",
    "\n",
    "First, define all the relevant parameters and configs required for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from oumi.core.configs import GenerationParams, InferenceConfig, ModelParams\n",
    "\n",
    "generation_params = GenerationParams(max_new_tokens=MODEL_MAX_TOKENS)\n",
    "model_params = ModelParams(model_name=MODEL_NAME, model_max_length=MODEL_MAX_TOKENS)\n",
    "inference_config = InferenceConfig(model=model_params, generation=generation_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, choose an inference engine that your model is compatible with. For more information on this, see Oumi's [inference documentation](https://oumi.ai/docs/en/latest/user_guides/infer/infer.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2025-01-16 16:03:01,819][oumi][rank0][pid:79891][MainThread][INFO]][llama_cpp_inference_engine.py:118] Loading model from Huggingface Hub: bartowski/Llama-3.2-1B-Instruct-GGUF.\n",
      "llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized\n"
     ]
    }
   ],
   "source": [
    "from oumi.inference import LlamaCppInferenceEngine\n",
    "\n",
    "inference_engine = LlamaCppInferenceEngine(model_params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, run inference to get responses from your model for the prompts contained in the `alpaca_dataset`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "69d56a10165e43e3bb195d7d3fcf55c1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "responses = inference_engine.infer(alpaca_dataset, inference_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, convert the responses from Oumi format (list of `Conversation`s) to Alpaca format (list of `dict`s, where the data is contained under the keys `instruction` and `output`). Create a DataFrame from the data and add a new column \"`generator`\", which captures the human-readable name of the model the responses were produced with. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from oumi.datasets.evaluation import utils\n",
    "\n",
    "responses_json = utils.conversations_to_alpaca_format(responses)\n",
    "responses_df = pd.DataFrame(responses_json)\n",
    "responses_df[\"generator\"] = MODEL_DISPLAY_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your DataFrame should look as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>instruction</th>\n",
       "      <th>output</th>\n",
       "      <th>generator</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>What are the names of some famous actors that ...</td>\n",
       "      <td>There are many famous actors who started their...</td>\n",
       "      <td>MyLlamaTestModel</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>How did US states get their names?</td>\n",
       "      <td>The origin of US state names is a fascinating ...</td>\n",
       "      <td>MyLlamaTestModel</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Hi, my sister and her girlfriends want me to p...</td>\n",
       "      <td>Kickball is a fun team sport that's easy to le...</td>\n",
       "      <td>MyLlamaTestModel</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         instruction  \\\n",
       "0  What are the names of some famous actors that ...   \n",
       "1                 How did US states get their names?   \n",
       "2  Hi, my sister and her girlfriends want me to p...   \n",
       "\n",
       "                                              output         generator  \n",
       "0  There are many famous actors who started their...  MyLlamaTestModel  \n",
       "1  The origin of US state names is a fascinating ...  MyLlamaTestModel  \n",
       "2  Kickball is a fun team sport that's easy to le...  MyLlamaTestModel  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "responses_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Run Alpaca Eval 2.0\n",
    "\n",
    "You can kick off evaluations as shown below. \n",
    "\n",
    "The default annotator for Alpaca Eval 2.0 is <b>GPT-4 Turbo</b>. While Alpaca Eval 1.0 was using a binary preference, Alpaca Eval 2.0 uses the logprobs to compute a continuous preference, resulting in a <b>weighted</b> win-rate. The default annotator config of Alpaca Eval 2.0 is thus `weighted_alpaca_eval_gpt4_turbo`. There is an option to use other annotators (judges) as well; see the [Annotators configs](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/README.md) page for details and relevant costs. However, the Alpaca 2.0 leaderboard is established with GPT4 Turbo as the reference annotator. Using other annotators is less informative if you are interested in generating comparative results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Evaluating the MyLlamaTestModel outputs.\n",
      "INFO:root:Creating the annotator from `weighted_alpaca_eval_gpt4_turbo`.\n",
      "INFO:root:Saving annotations to `/opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json`.\n",
      "WARNING:root:The length of outputs before and after merge are not the same. We have len(outputs_1)==805, len(outputs_2)==3, and len(df_annotated)==3. This means that there are missing examples or duplicates. We are taking a SQL inner join.\n",
      "Annotation chunk:   0%|          | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 3 examples with weighted_alpaca_eval_gpt4_turbo\n",
      "INFO:root:Using `openai_completions` on 3 prompts using gpt-4-1106-preview.\n",
      "INFO:root:Kwargs to completion: {'model': 'gpt-4-1106-preview', 'temperature': 1, 'logprobs': True, 'top_logprobs': 5, 'is_chat': True}. num_procs=5\n",
      "WARNING:root:/opt/miniconda3/envs/oumi/lib/python3.11/client_configs/openai_configs.yaml wasn't found. We are using environment variables to construct the client configs.This is the old and non-recommended way of doing it. Please see `client_configs/README.md` for the recommended way of specifying client configs.\n",
      "WARNING:root:/opt/miniconda3/envs/oumi/lib/python3.11/client_configs/openai_configs.yaml wasn't found. We are using environment variables to construct the client configs.This is the old and non-recommended way of doing it. Please see `client_configs/README.md` for the recommended way of specifying client configs.\n",
      "WARNING:root:/opt/miniconda3/envs/oumi/lib/python3.11/client_configs/openai_configs.yaml wasn't found. We are using environment variables to construct the client configs.This is the old and non-recommended way of doing it. Please see `client_configs/README.md` for the recommended way of specifying client configs.\n",
      "INFO:root:Using OAI client number 1 out of 1.\n",
      "INFO:root:Using OAI client number 1 out of 1.\n",
      "INFO:root:Using OAI client number 1 out of 1.\n",
      "INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "prompt_batches: 100%|██████████| 3/3 [00:01<00:00,  1.97it/s]\n",
      "INFO:root:Completed 3 examples in 1.5 seconds.\n",
      "INFO:root:Saving all annotations to /opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/evaluators_configs/weighted_alpaca_eval_gpt4_turbo/annotations_seed0_configs.json.\n",
      "Annotation chunk: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it]\n",
      "INFO:root:Saving all results to alpaca_eval_tutorial/weighted_alpaca_eval_gpt4_turbo\n",
      "INFO:root:Saving result to the precomputed leaderboard at /opt/miniconda3/envs/oumi/lib/python3.11/site-packages/alpaca_eval/leaderboards/data_AlpacaEval_2/weighted_alpaca_eval_gpt4_turbo_leaderboard.csv\n"
     ]
    }
   ],
   "source": [
    "from alpaca_eval import evaluate\n",
    "\n",
    "ANNOTATORS_CONFIG = \"weighted_alpaca_eval_gpt4_turbo\"\n",
    "\n",
    "df_leaderboard, annotations = evaluate(  # type: ignore\n",
    "    model_outputs=responses_df,\n",
    "    annotators_config=ANNOTATORS_CONFIG,\n",
    "    is_return_instead_of_print=True,\n",
    "    output_path=tutorial_dir,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Inspect the metrics\n",
    "\n",
    "Once the evaluation process completes, you can inspect the metrics produced, as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Metrics for `MyLlamaTestModel`\n",
      " - win_rate=0.0016009055697541186\n",
      " - standard_error=0.0007292077882424806\n",
      " - n_wins=0\n",
      " - n_wins_base=3\n",
      " - n_draws=0\n",
      " - n_total=3\n",
      " - discrete_win_rate=0.0\n",
      " - mode=community\n",
      " - avg_length=2192\n",
      " - length_controlled_winrate=0.051919831641191114\n",
      " - lc_standard_error=0.013651061718406078\n"
     ]
    }
   ],
   "source": [
    "metrics = df_leaderboard.loc[MODEL_DISPLAY_NAME]\n",
    "\n",
    "print(f\"Metrics for `{MODEL_DISPLAY_NAME}`\")\n",
    "for metric, value in metrics.items():\n",
    "    print(f\" - {metric}={value}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Retain your configuration for reproducibility\n",
    "\n",
    "In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from importlib.metadata import version\n",
    "\n",
    "evaluation_config_dict = {\n",
    "    \"packages\": {\n",
    "        \"alpaca_eval\": version(\"alpaca_eval\"),\n",
    "        \"oumi\": version(\"oumi\"),\n",
    "    },\n",
    "    \"configs\": {\n",
    "        \"inference_config\": str(inference_config),\n",
    "        \"annotators_config\": ANNOTATORS_CONFIG,\n",
    "    },\n",
    "    \"eval_metrics\": metrics.to_dict(),\n",
    "}\n",
    "\n",
    "evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)\n",
    "with open(f\"{tutorial_dir}/evaluation_config.json\", \"w\") as output_file:\n",
    "    output_file.write(evaluation_config_json)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
